A New Partitioning Around Medoids Algorithm

نویسندگان

  • Mark J. van der Laan
  • Katherine S. Pollard
  • Jennifer Bryan
چکیده

Kaufman & Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this note, we propose to partition around medoids by maximizing a criteria “Average Silhouette” defined by Kaufman & Rousseeuw. We also propose a fast-to-compute approximation of ”Average Silhouette”. We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations. 1 A new partitioning around medoids algorithm Suppose that one is interested in clustering p elements xj , j ∈ {1, . . . , p} and that each element xj is an n dimensional vector (x1j, . . . , xnj) T . We have encountered this problem in the gene expression context, where each element is a gene whose relative expression has been measured across a variety of experiments or patients. Other contexts in which clustering has been applied include environmental studies, astronomy, and digit recognition. Let d(xj ,xj′) denote the dissimilarity between elements j and j′ and let D be the p × p symmetric matrix of dissimilarities. Typical choices of dissimilarity include Euclidean distance, 1 minus correlation, 1 minus absolute correlation and 1 minus cosine-angle. For example, the cosine-angle distance between two vectors was used in Eisen et al. [1998] to cluster genes based on gene expression data across a variety of cell lines. It is of interest to note that the cosine-angle distance equals 0.5 times the squared Euclidean distance standardized to have Euclidean norm 1. The clustering procedure PAM [Kaufman and Rousseeuw, 1990, chap. 2] takes as input such a dissimilarity matrix D and produces as output a set of cluster centers or “medoids”. Let K be the number of clusters and let M = (M1, . . . , MK) denote any size K collection of the n elements xj . Given M, we can calculate the dissimilarity d(xj , Mk) of each element and each member of M. For each element xj , we denote the minimum and minimizer by mink=1,...,K d(xj , Mk) = d1(xj ,M) and min−1 k=1,...,K d(xj , Mk) = l1(xj ,M). PAM selects the medoids M ∗ by minimizing the sum of such distances M∗ = min−1 M ∑ j d1(xj , M). Each medoid M ∗ k identifies a cluster, defined as the elements which are closer to this medoid than to any other. This clustering is captured by a vector of labels l(X,M∗) = (l1(x1,M∗), . . . , l1(xp,M∗)). One can consider K as given or it can be data-adaptively selected, for example, by maximizing the average silhouette as recommended by Kaufman and Rousseeuw. The silhouette for a given element is calculated as follows. For each gene j, calculate aj which is the average dissimilarity of gene j with other elements of its cluster: aj = avg d(xj ,xj′), j ′ ∈ {i : l1(xi, M) = l1(xj , M)}. For each gene j and each cluster k to which it does not belong (that is, k 6= l1(xj , M)), calculate bjk, which is the average dissimilarity of gene j with the members of cluster k: bjk = avg d(xj ,xj′), j ′ ∈ {i : l1(xi, M) = k}. Let bj = mink bjk. The silhouette of gene j is defined by the formula: Sj(M) = bj − aj max(aj, bj) . (1) Note that the largest possible silhouette is 1, which occurs only if there is no dissimilarity within gene j’s cluster (i.e.: aj = 0). The other extreme is -1. Heuristically, the silhouette measures how well matched an object is to the other objects in its own cluster versus how well matched it would be if it were moved to another cluster. It has also been our experience, based on simulated and real gene expression data, that the average silhouette is actually a very good measure of the strength of clustering results: 1 Hosted by The Berkeley Electronic Press see also [Fridlyand, 2001] for a favorable performance of average silhouette relative to other validation functionals. PAM has several favorable properties. Since PAM performs clustering with respect to any specified distance metric, it allows a flexible definition of what it means for two elements to be “close”. We have found that this flexibility is particularly important in biological applications where researchers may be interested, for example, in grouping correlated or possibly also anti-correlated elements. Many clustering algorithms do not allow for a flexible definition of similarity. KMEANS, for example, could be performed with respect to any metric, but allows only Euclidean and Manhattan distance in current implementations of which we are aware. In addition to allowing a flexible distance metric, PAM has the advantage of identifying clusters by the medoids. Medoids are robust representations of the cluster centers that are less sensitive to outliers than other cluster profiles, such as the cluster means of KMEANS. This robustness is particularly important in the common context that many elements do not belong well to any cluster. We have found some cases, in both real and simulated data, where existing clustering routines fail to find the main clusters. The problem of finding relatively small clusters in the presence of one or more larger clusters is particularly hard. In this situation, PAM often does not succeed in finding a sensible set of medoids. This criticism is just as valid (if not more so) for KMEANS and Self-Organizing Maps (SOMs). Inspired by this lack of performance, we present a modification of PAM that maximizes over all potential medoids the average silhouette. Given the number of clusters K, PAM maximizes over all potential medoids M the function f(M) = − ∑

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhancing Clustering Algorithm to Plan Efficient Mobile Network

With the rapid development in mobile network effective network planning tool is needed to satisfy the need of customers. However, deciding upon the optimum placement for the base stations (BS) to achieve best services while reducing the cost is a complex task requiring vast computational resource. This paper addresses antenna placement problem or the cell planning problem, involves locating and...

متن کامل

A simple and fast algorithm for K-medoids clustering

This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm and tests several methods for selecting initial medoids. The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every iterative step. To evaluate the proposed algorithm, we use some real and artificial data sets and compare with the results of other algor...

متن کامل

A K-means-like Algorithm for K-medoids Clustering

Clustering analysis is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes. This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm and tests several methods for selecting initial medoids. The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every i...

متن کامل

Using Modified Partitioning Around Medoids Clustering Technique in Mobile Network Planning

Optimization mobile radio network planning is a very complex task, as many aspects must be taken into account. Deciding upon the optimum placement for the base stations (BS’s) to achieve best services while reducing the cost is a complex task requiring vast computational resource. This paper introduces the spatial clustering to solve the Mobile Networking Planning problem. It addresses antenna ...

متن کامل

K-medoids Clustering Using Partitioning around Medoids for Performing Face Recognition

Face recognition is one of the most unobtrusive biometric techniques that can be used for access control as well as surveillance purposes. Various methods for implementing face recognition have been proposed with varying degrees of performance in different scenarios. The most common issue with effective facial biometric systems is high susceptibility of variations in the face owing to different...

متن کامل

Feature Level Fusion of Face and Palmprint Biometrics

This paper presents a feature level fusion approach which uses the improved K-medoids clustering algorithm and isomorphic graph for face and palmprint biometrics. Partitioning around medoids (PAM) algorithm is used to partition the set of n invariant feature points of the face and palmprint images into k clusters. By partitioning the face and palmprint images with scale invariant features SIFT ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002